In modern society, fitness has become an integral part of many people’s daily lives. However, there is limited in-depth data analysis on the behavioral patterns, demographic characteristics, and geographic distributions of gym-goers. This project aims to address this gap by systematically analyzing data on calorie expenditure, workout duration, and subscription plan choices among gym-goers to identify key factors influencing these behaviors. Our goal is to provide valuable insights for the fitness industry, public health researchers, and policymakers, enabling them to better understand fitness trends and develop more effective interventions and strategies to promote overall health.
What factors contribute to the calories a user burns in gyms?
What factors contribute to the workout duration and frequency of users’ gym visits?
What factors contribute to the selection of users’ subscription plans?
You can find the raw datasets here. Click the button to show relative code.
# Import 4 raw datasets
users_data = read_csv("datasets/users_data.csv")
gyms_data = read_csv("datasets/gym_locations_data.csv")
history_data = read_csv("datasets/checkin_checkout_history_updated.csv")
plans_data = read_csv("datasets/subscription_plans.csv")
# Clean/tidy 4 datasets
## 1. users_data --> users_tidy:
users_tidy = users_data |>
mutate(
name = paste(first_name, last_name),
membership_days = as.numeric(difftime(as.Date("2024-09-30"), as.Date(sign_up_date, format = "%Y-%m-%d"), units = "days"))
) |>
select(user_id, name, age, gender, membership_days, user_location, subscription_plan)
## 2. gyms_data --> gyms_tidy:
gyms_tidy = gyms_data |>
mutate(
climbing_wall = ifelse(grepl("Climbing Wall", facilities), 1, 0),
swimming_pool = ifelse(grepl("Swimming Pool", facilities), 1, 0),
basketball_court = ifelse(grepl("Basketball Court", facilities), 1, 0),
yoga_classes = ifelse(grepl("Yoga Classes", facilities), 1, 0),
sauna = ifelse(grepl("Sauna", facilities), 1, 0),
crossfit = ifelse(grepl("CrossFit", facilities), 1, 0)) |>
select(gym_id, location, gym_type, climbing_wall, swimming_pool, basketball_court, yoga_classes, sauna, crossfit)
## 3. history_data --> history_tidy
history_tidy = history_data |>
mutate(
workout_year = year(as.Date(checkin_time, format = "%Y-%m-%d %H:%M:%S")),
workout_month = month(as.Date(checkin_time, format = "%Y-%m-%d %H:%M:%S")),
workout_time = format(as.POSIXct(checkin_time, format = "%Y-%m-%d %H:%M:%S"), "%H:%M"),
workout_duration = as.numeric(difftime(as.POSIXct(checkout_time, format = "%Y-%m-%d %H:%M:%S"),
as.POSIXct(checkin_time, format = "%Y-%m-%d %H:%M:%S"),
units = "mins")),
workout_timecat = case_when(
hour(as.POSIXct(checkin_time, format = "%Y-%m-%d %H:%M:%S")) %in% 6:11 ~ "morning",
hour(as.POSIXct(checkin_time, format = "%Y-%m-%d %H:%M:%S")) %in% 12:17 ~ "afternoon",
TRUE ~ "evening"
),
calories_per_min = round(ifelse(workout_duration > 0, calories_burned / workout_duration, 0), 2)
) |>
select(user_id, gym_id, workout_year, workout_month, workout_time, workout_timecat, workout_type, workout_duration, calories_burned, calories_per_min)
## 4. plans_data --> plans_tidy
plans_tidy = plans_data |>
select(subscription_plan, price_per_month)
# Merge 4 datasets to get the final dataset
## 1. users_tidy & history_tidy, by user_id
users_history_tidy = history_tidy |>
left_join(users_tidy |>
rename(user_name = name),
by = "user_id")
## 2. users_history_tidy & gyms_tidy, by gym_id
users_gyms_tidy = users_history_tidy |>
left_join(gyms_tidy |>
rename(gym_location = location),
by = "gym_id")
## 3. users_gyms_tidy & plans_tidy, by subscription_plan
final = users_gyms_tidy |>
left_join(plans_tidy, by = "subscription_plan")
Four raw datasets were imported for analysis:
users_data: User demographic and subscription
information.
gyms_data: Details about gym locations and available
facilities.
history_data: Workout history, including
check-in/out times and calories burned.
plans_data: Subscription plans with associated
pricing.
users_data:
Created name by combining first and last
names.
Calculated membership_days from the sign-up date to
2024-09-30.
Retained essential columns: user_id,
name, age, gender,
membership_days, user_location, and
subscription_plan.
gyms_data:
Converted gym facilities into binary indicators (\(1 = present,\ 0 = absent\)) for
climbing_wall, swimming_pool,
basketball_court, yoga_classes,
sauna, and crossfit.
Retained relevant columns: gym_id,
location, gym_type, and facility
indicators.
history_data:
Extracted year and month as workout_year and
workout_month.
Calculated workout_duration (minutes) from check-in
and check-out times.
Categorized workout times into morning,
afternoon, and evening periods
(workout_timecat).
Calculated calories_per_min as calories burned
divided by workout duration.
Retained key columns: user_id, gym_id,
workout_year, workout_month,
workout_time, workout_timecat,
workout_type, workout_duration,
calories_burned, and
calories_per_min.
plans_data:
subscription_plan, price_per_month.Step 1: Merged users_tidy with
history_tidy by user_id, renaming
name to user_name.
Step 2: Merged users_history_tidy
with gyms_tidy by gym_id, renaming
location to gym_location.
Step 3: Merged users_gyms_tidy with
plans_tidy by subscription_plan.
The final dataset, final, contains 25 variables for
analysis and 300000 observations. Below is the codebook for
final:
| ID | Variable | Description |
|---|---|---|
| 1 | user_id |
Unique identifier for each user |
| 2 | gym_id |
ID of the gym where the check-in occurred |
| 3 | workout_year |
Year of check-in time |
| 4 | workout_month |
Month of check-in time |
| 5 | workout_time |
Exact check-in time |
| 6 | workout_timecat |
morning (6:00-11:59); afternoon (12:00-17:59); evening (18:00-24:00) |
| 7 | workout_type |
Type of workout performed during the visit (e.g., Cardio, Weightlifting, Yoga) |
| 8 | workout_duration |
“Check-out time” – “Check-in time” (minute) |
| 9 | calories_burned |
Estimated number of calories burned during the workout |
| 10 | calories_per_min |
Calories burned per minute |
| 11 | user_name |
Users’ full name |
| 12 | age |
Age of the user |
| 13 | gender |
Male; Female; Non-binary |
| 14 | membership_days |
Total days of membership (from signed-up to 2024-9-31–when the datasets were last updated by the author) |
| 15 | user_location |
City where the user lives |
| 16 | subscription_plan |
The user’s gym subscription plan (Basic, Pro, Student) |
| 17 | gym_location |
Location of gym |
| 18 | gym_type |
The type of gym (Premium, Standard, Budget) |
| 19 | climbing_wall |
One of the facilities. 1=yes; 0=no |
| 20 | swimming_pool |
One of the facilities. 1=yes; 0=no |
| 21 | basketball_court |
One of the facilities. 1=yes; 0=no |
| 22 | yoga_classes |
One of the facilities. 1=yes; 0=no |
| 23 | sauna |
One of the facilities. 1=yes; 0=no |
| 24 | crossfit |
One of the facilities. 1=yes; 0=no |
| 25 | price_per_month |
Price per month in Dollar |
We create a 2D histogram visualizes to represent the relationship between workout duration and calories burned per minute.
plot1 =
final |>
ggplot(aes(x = workout_duration, y = calories_per_min)) +
geom_bin2d() +
scale_fill_gradientn(colors = c("#FFFFFF", "#A3C9A8", "#F4A261", "#E76F51")) +
labs(
title = "Scatter Plot of Workout Duration and Calories Burned",
x = "Workout Duration (minutes)",
y = "Calories Burned per Minute"
)
ggplotly(plot1)
It reveals that shorter workout durations are generally associated with higher calorie burn rates, while longer durations show lower burn rates, likely reflecting pacing effects. The highest density of data points appears concentrated at higher durations and moderate calorie burn rates.
generate_histogram <- function(data, workout_type) {
plot_ly(
data = subset(data, workout_type == workout_type),
x = ~calories_per_min,
type = "histogram",
opacity = 0.7,
name = workout_type
)
}
workout_type = final |>
distinct(workout_type) |>
pull(workout_type)
plots = lapply(workout_type, function(wt) generate_histogram(final, wt))
plot2 = subplot(
plots,
nrows = 2,
shareX = TRUE, shareY = TRUE
) %>% layout(
title = "Calories Burned Per Minute Distribution by Workout Type",
xaxis = list(title = "Calories Burned Per Minute"),
yaxis = list(title = "Count")
)
plot2